Author: Paul A. Beata
GitHub: pbeata
Data Source: All these data sets are made up of data from the US government's website called The World Factbook containing information for each country.
Project Goals: To gain insights into similarity between countries and regions of the world through exploratory data analysis and by experimenting with different clusters of countries. The key focus is to think about what these clusters actually represent when segmenting the world's nations into different groups (clusters) based on data alone.
The data set used in this project can be found in the GitHub repository where all of this work is stored: Link to Repository.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%config Completer.use_jedi = False
df = pd.read_csv('./data/CIA_Country_Facts.csv')
df.head()
| Country | Region | Population | Area (sq. mi.) | Pop. Density (per sq. mi.) | Coastline (coast/area ratio) | Net migration | Infant mortality (per 1000 births) | GDP ($ per capita) | Literacy (%) | Phones (per 1000) | Arable (%) | Crops (%) | Other (%) | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | ASIA (EX. NEAR EAST) | 31056997 | 647500 | 48.0 | 0.00 | 23.06 | 163.07 | 700.0 | 36.0 | 3.2 | 12.13 | 0.22 | 87.65 | 1.0 | 46.60 | 20.34 | 0.380 | 0.240 | 0.380 |
| 1 | Albania | EASTERN EUROPE | 3581655 | 28748 | 124.6 | 1.26 | -4.93 | 21.52 | 4500.0 | 86.5 | 71.2 | 21.09 | 4.42 | 74.49 | 3.0 | 15.11 | 5.22 | 0.232 | 0.188 | 0.579 |
| 2 | Algeria | NORTHERN AFRICA | 32930091 | 2381740 | 13.8 | 0.04 | -0.39 | 31.00 | 6000.0 | 70.0 | 78.1 | 3.22 | 0.25 | 96.53 | 1.0 | 17.14 | 4.61 | 0.101 | 0.600 | 0.298 |
| 3 | American Samoa | OCEANIA | 57794 | 199 | 290.4 | 58.29 | -20.71 | 9.27 | 8000.0 | 97.0 | 259.5 | 10.00 | 15.00 | 75.00 | 2.0 | 22.46 | 3.27 | NaN | NaN | NaN |
| 4 | Andorra | WESTERN EUROPE | 71201 | 468 | 152.1 | 0.00 | 6.60 | 4.05 | 19000.0 | 100.0 | 497.2 | 2.22 | 0.00 | 97.78 | 3.0 | 8.71 | 6.25 | NaN | NaN | NaN |
Here we can also explore the rows and columns of the data, as well as the data types of the columns.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 227 entries, 0 to 226 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 227 non-null object 1 Region 227 non-null object 2 Population 227 non-null int64 3 Area (sq. mi.) 227 non-null int64 4 Pop. Density (per sq. mi.) 227 non-null float64 5 Coastline (coast/area ratio) 227 non-null float64 6 Net migration 224 non-null float64 7 Infant mortality (per 1000 births) 224 non-null float64 8 GDP ($ per capita) 226 non-null float64 9 Literacy (%) 209 non-null float64 10 Phones (per 1000) 223 non-null float64 11 Arable (%) 225 non-null float64 12 Crops (%) 225 non-null float64 13 Other (%) 225 non-null float64 14 Climate 205 non-null float64 15 Birthrate 224 non-null float64 16 Deathrate 223 non-null float64 17 Agriculture 212 non-null float64 18 Industry 211 non-null float64 19 Service 212 non-null float64 dtypes: float64(16), int64(2), object(2) memory usage: 35.6+ KB
# check for missing values
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 3 Infant mortality (per 1000 births) 3 GDP ($ per capita) 1 Literacy (%) 18 Phones (per 1000) 4 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 22 Birthrate 3 Deathrate 4 Agriculture 15 Industry 16 Service 15 dtype: int64
# check our continuous (numerical) features
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Population | 227.0 | 2.874028e+07 | 1.178913e+08 | 7026.000 | 437624.00000 | 4786994.000 | 1.749777e+07 | 1.313974e+09 |
| Area (sq. mi.) | 227.0 | 5.982270e+05 | 1.790282e+06 | 2.000 | 4647.50000 | 86600.000 | 4.418110e+05 | 1.707520e+07 |
| Pop. Density (per sq. mi.) | 227.0 | 3.790471e+02 | 1.660186e+03 | 0.000 | 29.15000 | 78.800 | 1.901500e+02 | 1.627150e+04 |
| Coastline (coast/area ratio) | 227.0 | 2.116533e+01 | 7.228686e+01 | 0.000 | 0.10000 | 0.730 | 1.034500e+01 | 8.706600e+02 |
| Net migration | 224.0 | 3.812500e-02 | 4.889269e+00 | -20.990 | -0.92750 | 0.000 | 9.975000e-01 | 2.306000e+01 |
| Infant mortality (per 1000 births) | 224.0 | 3.550696e+01 | 3.538990e+01 | 2.290 | 8.15000 | 21.000 | 5.570500e+01 | 1.911900e+02 |
| GDP ($ per capita) | 226.0 | 9.689823e+03 | 1.004914e+04 | 500.000 | 1900.00000 | 5550.000 | 1.570000e+04 | 5.510000e+04 |
| Literacy (%) | 209.0 | 8.283828e+01 | 1.972217e+01 | 17.600 | 70.60000 | 92.500 | 9.800000e+01 | 1.000000e+02 |
| Phones (per 1000) | 223.0 | 2.360614e+02 | 2.279918e+02 | 0.200 | 37.80000 | 176.200 | 3.896500e+02 | 1.035600e+03 |
| Arable (%) | 225.0 | 1.379711e+01 | 1.304040e+01 | 0.000 | 3.22000 | 10.420 | 2.000000e+01 | 6.211000e+01 |
| Crops (%) | 225.0 | 4.564222e+00 | 8.361470e+00 | 0.000 | 0.19000 | 1.030 | 4.440000e+00 | 5.068000e+01 |
| Other (%) | 225.0 | 8.163831e+01 | 1.614083e+01 | 33.330 | 71.65000 | 85.700 | 9.544000e+01 | 1.000000e+02 |
| Climate | 205.0 | 2.139024e+00 | 6.993968e-01 | 1.000 | 2.00000 | 2.000 | 3.000000e+00 | 4.000000e+00 |
| Birthrate | 224.0 | 2.211473e+01 | 1.117672e+01 | 7.290 | 12.67250 | 18.790 | 2.982000e+01 | 5.073000e+01 |
| Deathrate | 223.0 | 9.241345e+00 | 4.990026e+00 | 2.290 | 5.91000 | 7.840 | 1.060500e+01 | 2.974000e+01 |
| Agriculture | 212.0 | 1.508443e-01 | 1.467980e-01 | 0.000 | 0.03775 | 0.099 | 2.210000e-01 | 7.690000e-01 |
| Industry | 211.0 | 2.827109e-01 | 1.382722e-01 | 0.020 | 0.19300 | 0.272 | 3.410000e-01 | 9.060000e-01 |
| Service | 212.0 | 5.652830e-01 | 1.658410e-01 | 0.062 | 0.42925 | 0.571 | 6.785000e-01 | 9.540000e-01 |
Here we create visualizations to explore the data further.
Box Plot of Population
# we could have some outliers in the data
plt.figure(figsize=(10,5), dpi=200)
sns.boxplot(data=df, x='Population', y='Region')
plt.show()
# if we reduce the range of the box plot above, we will lose India and China as they are extreme outliers for Asia
# (see new box plot below)
df.sort_values(by='Population', ascending=False)[['Country', 'Population']]
| Country | Population | |
|---|---|---|
| 42 | China | 1313973713 |
| 94 | India | 1095351995 |
| 214 | United States | 298444215 |
| 95 | Indonesia | 245452739 |
| 27 | Brazil | 188078227 |
| ... | ... | ... |
| 144 | Nauru | 13287 |
| 209 | Tuvalu | 11810 |
| 140 | Montserrat | 9439 |
| 171 | Saint Helena | 7502 |
| 174 | St Pierre & Miquelon | 7026 |
227 rows × 2 columns
# we could have some outliers in the data
plt.figure(figsize=(10,5), dpi=200)
sns.boxplot(data=df, x='Population', y='Region')
plt.xlim(0, 0.4e9)
plt.show()
Histogram of Population
sns.histplot(data=df, x='Population');
Using our knowledge of outliers from above, we will zoom in to only show the populations of countries that have less than 4 billion people.
sns.histplot(data=df[df['Population'] < 0.4e9], x='Population');
GDP and Regions
Here we create a bar chart showing the mean GDP per Capita per region.
plt.figure(figsize=(8,5), dpi=100)
sns.barplot(data=df, x='Region', y='GDP ($ per capita)')
plt.xticks(rotation=90)
plt.title('GDP in \\$USD per Capita by Region')
plt.show()
Here we plot the relationship between phones per 1000 people and the GDP per capita, with the data points colored by region.
def fig(x=8, y=4, d=100):
return plt.figure(figsize=(x,y), dpi=d)
fig(y=6)
sns.scatterplot(data=df,
x='GDP ($ per capita)',
y='Phones (per 1000)',
hue='Region',
style='Region')
# plt.xscale("log")
plt.legend(loc=(1.05, 0.25))
plt.show()
Here we plot the relationship between GDP per capita and literacy, with the data points colored by region.
fig(y=6, d=200)
sns.scatterplot(data=df,
x='GDP ($ per capita)',
y='Literacy (%)',
hue='Region',
style='Region')
# plt.xscale("log")
# plt.yscale("log")
plt.legend(loc=(1.05, 0.25))
plt.show()
Obervations
Heatmap of the Correlation Among Features
sns.heatmap(data=df.corr(), cmap='rainbow');
Clustering
For clustering, we know that Seaborn can automatically perform hierarchal clustering through the clustermap() function. So here we will start by creating a clustermap of the correlations between each feature of the data set with this function.
sns.clustermap(data=df.corr());
Now we will prepare our data for K-Means Clustering with scikit-learn.
# missing values by column
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 3 Infant mortality (per 1000 births) 3 GDP ($ per capita) 1 Literacy (%) 18 Phones (per 1000) 4 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 22 Birthrate 3 Deathrate 4 Agriculture 15 Industry 16 Service 15 dtype: int64
There are 15 countries with missing values for "Agriculture". What is the main aspect of these countries?
df[df['Agriculture'].isnull()]['Country']
3 American Samoa 4 Andorra 78 Gibraltar 80 Greenland 83 Guam 134 Mayotte 140 Montserrat 144 Nauru 153 N. Mariana Islands 171 Saint Helena 174 St Pierre & Miquelon 177 San Marino 208 Turks & Caicos Is 221 Wallis and Futuna 223 Western Sahara Name: Country, dtype: object
Many of these countries that have a NaN value for agriculture are islands. Others are smaller regions of other countries, such as San Marino (Italy) and Gibraltar (British Territory), so it's likely that their agriculture is quite small or virtually non-existent. Greenland is the one exception, but due to its climate and from a quick Google search, we realize that "... the lack of agriculture, forestry and similar countryside activities has meant that very few country roads have been built" (Wikipedia page for Greenland), and thus there is essentially no agriculture industry there either. For these countries, we will fill their agriculture values with zero:
# fill missing values in the agriculture column with zero
df['Agriculture'].fillna(0, inplace=True)
df['Agriculture'].isnull().sum()
0
TASK: Now check to see what is still missing by counting number of missing elements again per feature:
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 3 Infant mortality (per 1000 births) 3 GDP ($ per capita) 1 Literacy (%) 18 Phones (per 1000) 4 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 22 Birthrate 3 Deathrate 4 Agriculture 0 Industry 16 Service 15 dtype: int64
TASK: Notice climate is missing for a few countries, but not the Region! Let's use this to our advantage. Fill in the missing Climate values based on the mean climate value for its region.
Hints on how to do this: https://stackoverflow.com/questions/19966018/pandas-filling-missing-values-by-mean-in-each-group
df.groupby('Region').mean()['Climate']
Region ASIA (EX. NEAR EAST) 1.962963 BALTICS 3.000000 C.W. OF IND. STATES 2.550000 EASTERN EUROPE 3.111111 LATIN AMER. & CARIB 2.033333 NEAR EAST 1.666667 NORTHERN AFRICA 1.500000 NORTHERN AMERICA 2.000000 OCEANIA 2.000000 SUB-SAHARAN AFRICA 1.885417 WESTERN EUROPE 3.095238 Name: Climate, dtype: float64
df['Climate'] = df.groupby('Region').transform(lambda x: x.fillna(x.mean()))
Check again on many elements are missing:
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 3 Infant mortality (per 1000 births) 3 GDP ($ per capita) 1 Literacy (%) 18 Phones (per 1000) 4 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 0 Birthrate 3 Deathrate 4 Agriculture 0 Industry 16 Service 15 dtype: int64
TASK: It looks like Literacy percentage is missing. Use the same tactic as we did with Climate missing values and fill in any missing Literacy % values with the mean Literacy % of the Region.
df['Literacy (%)'] = df.groupby('Region').transform(lambda x: x.fillna(x.mean()))
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 3 Infant mortality (per 1000 births) 3 GDP ($ per capita) 1 Literacy (%) 0 Phones (per 1000) 4 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 0 Birthrate 3 Deathrate 4 Agriculture 0 Industry 16 Service 15 dtype: int64
In this section I will handle missing values by updating our DataFrame with the current 2021 estimates listed on the cia.gov website for the following features:
Net Migration
df[df['Net migration'].isnull()]['Country']
47 Cook Islands 221 Wallis and Futuna 223 Western Sahara Name: Country, dtype: object
# we found the updated values for two of the countries:
df.loc[47, 'Net migration'] = -28.58
df.loc[221, 'Net migration'] = -4.16
# for the remaining countries, we can use the average for that region:
df['Net migration'] = df.groupby('Region').transform(lambda x: x.fillna(x.mean()))
# check for missing values
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 0 Infant mortality (per 1000 births) 3 GDP ($ per capita) 1 Literacy (%) 0 Phones (per 1000) 4 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 0 Birthrate 3 Deathrate 4 Agriculture 0 Industry 16 Service 15 dtype: int64
Infant mortality (per 1000 births)
df[df['Infant mortality (per 1000 births)'].isnull()]['Country']
47 Cook Islands 221 Wallis and Futuna 223 Western Sahara Name: Country, dtype: object
# we found the updated values for two of the countries:
df.loc[47, 'Infant mortality (per 1000 births)'] = 16.33
df.loc[221, 'Infant mortality (per 1000 births)'] = 4.06
# for the remaining countries, we can use the average for that region:
df['Infant mortality (per 1000 births)'] = df.groupby('Region').transform(lambda x: x.fillna(x.mean()))
# check for missing values
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 0 Infant mortality (per 1000 births) 0 GDP ($ per capita) 1 Literacy (%) 0 Phones (per 1000) 4 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 0 Birthrate 3 Deathrate 4 Agriculture 0 Industry 16 Service 15 dtype: int64
GDP ($ per capita)
df[df['GDP ($ per capita)'].isnull()]['Country']
223 Western Sahara Name: Country, dtype: object
# since Western Sahara has had so many missing values and we were not able to find any updated values,
# we have decided to drop this country from the data set
df = df.drop([223], axis=0)
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 0 Infant mortality (per 1000 births) 0 GDP ($ per capita) 0 Literacy (%) 0 Phones (per 1000) 3 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 0 Birthrate 2 Deathrate 3 Agriculture 0 Industry 15 Service 15 dtype: int64
Birthrate and Deathrate
df[df['Birthrate'].isnull()]['Country']
181 Serbia 221 Wallis and Futuna Name: Country, dtype: object
df.loc[181, 'Birthrate'] = 8.74
df.loc[221, 'Birthrate'] = 12.49
df[df['Deathrate'].isnull()]['Country']
47 Cook Islands 181 Serbia 221 Wallis and Futuna Name: Country, dtype: object
df.loc[47, 'Deathrate'] = 8.89
df.loc[181, 'Deathrate'] = 13.49
df.loc[221, 'Deathrate'] = 5.74
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 0 Infant mortality (per 1000 births) 0 GDP ($ per capita) 0 Literacy (%) 0 Phones (per 1000) 3 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 0 Birthrate 0 Deathrate 0 Agriculture 0 Industry 15 Service 15 dtype: int64
Arable, Crops, Other
df[df['Arable (%)'].isnull()]['Country']
85 Guernsey 134 Mayotte Name: Country, dtype: object
df[df['Crops (%)'].isnull()]['Country']
85 Guernsey 134 Mayotte Name: Country, dtype: object
df[df['Other (%)'].isnull()]['Country']
85 Guernsey 134 Mayotte Name: Country, dtype: object
# we are going to drop these two countries as well
df = df.drop([85, 134], axis=0)
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 0 Infant mortality (per 1000 births) 0 GDP ($ per capita) 0 Literacy (%) 0 Phones (per 1000) 3 Arable (%) 0 Crops (%) 0 Other (%) 0 Climate 0 Birthrate 0 Deathrate 0 Agriculture 0 Industry 14 Service 14 dtype: int64
Phones (per 1000)
df[df['Phones (per 1000)'].isnull()]['Country']
52 Cyprus 58 East Timor 140 Montserrat Name: Country, dtype: object
df['Phones (per 1000)'] = df.groupby('Region').transform(lambda x: x.fillna(x.mean()))
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 0 Infant mortality (per 1000 births) 0 GDP ($ per capita) 0 Literacy (%) 0 Phones (per 1000) 0 Arable (%) 0 Crops (%) 0 Other (%) 0 Climate 0 Birthrate 0 Deathrate 0 Agriculture 0 Industry 14 Service 14 dtype: int64
Industry and Service
df[['Industry', 'Service']]
| Industry | Service | |
|---|---|---|
| 0 | 0.240 | 0.380 |
| 1 | 0.188 | 0.579 |
| 2 | 0.600 | 0.298 |
| 3 | NaN | NaN |
| 4 | NaN | NaN |
| ... | ... | ... |
| 221 | NaN | NaN |
| 222 | 0.280 | 0.630 |
| 224 | 0.472 | 0.393 |
| 225 | 0.290 | 0.489 |
| 226 | 0.243 | 0.579 |
224 rows × 2 columns
df[['Industry', 'Service']].describe()
| Industry | Service | |
|---|---|---|
| count | 210.000000 | 210.000000 |
| mean | 0.283581 | 0.564619 |
| std | 0.138022 | 0.164897 |
| min | 0.020000 | 0.062000 |
| 25% | 0.196500 | 0.431500 |
| 50% | 0.272500 | 0.571000 |
| 75% | 0.341500 | 0.677500 |
| max | 0.906000 | 0.954000 |
df.groupby('Region').mean()[['Industry', 'Service']]
| Industry | Service | |
|---|---|---|
| Region | ||
| ASIA (EX. NEAR EAST) | 0.302143 | 0.520107 |
| BALTICS | 0.293333 | 0.661667 |
| C.W. OF IND. STATES | 0.328000 | 0.480167 |
| EASTERN EUROPE | 0.309250 | 0.598667 |
| LATIN AMER. & CARIB | 0.256116 | 0.650721 |
| NEAR EAST | 0.406000 | 0.530000 |
| NORTHERN AFRICA | 0.426200 | 0.438400 |
| NORTHERN AMERICA | 0.199333 | 0.787000 |
| OCEANIA | 0.215250 | 0.608937 |
| SUB-SAHARAN AFRICA | 0.266878 | 0.449755 |
| WESTERN EUROPE | 0.252435 | 0.707870 |
fig()
sns.scatterplot(data=df, x='Industry', y='Service', hue='Region')
plt.legend(loc=(1.05, 0.25))
plt.show()
df['Industry'] = df.groupby('Region').transform(lambda x: x.fillna(x.mean()))
df['Service'] = df.groupby('Region').transform(lambda x: x.fillna(x.mean()))
df.isnull().sum()
Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 0 Infant mortality (per 1000 births) 0 GDP ($ per capita) 0 Literacy (%) 0 Phones (per 1000) 0 Arable (%) 0 Crops (%) 0 Other (%) 0 Climate 0 Birthrate 0 Deathrate 0 Agriculture 0 Industry 0 Service 0 dtype: int64
# make sure there are no missing values
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 224 entries, 0 to 226 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 224 non-null object 1 Region 224 non-null object 2 Population 224 non-null int64 3 Area (sq. mi.) 224 non-null int64 4 Pop. Density (per sq. mi.) 224 non-null float64 5 Coastline (coast/area ratio) 224 non-null float64 6 Net migration 224 non-null float64 7 Infant mortality (per 1000 births) 224 non-null float64 8 GDP ($ per capita) 224 non-null float64 9 Literacy (%) 224 non-null float64 10 Phones (per 1000) 224 non-null float64 11 Arable (%) 224 non-null float64 12 Crops (%) 224 non-null float64 13 Other (%) 224 non-null float64 14 Climate 224 non-null float64 15 Birthrate 224 non-null float64 16 Deathrate 224 non-null float64 17 Agriculture 224 non-null float64 18 Industry 224 non-null float64 19 Service 224 non-null float64 dtypes: float64(16), int64(2), object(2) memory usage: 36.8+ KB
# CHECKPOINT
df_clean = df.copy()
Prepare the data for clustering: the "Country" column is still a unique identifier string, so it won't be useful for clustering, since its unique for each point. Therefore, we can drop the "Country" column now:
# drop the country column
df_countries = df_clean.copy()
df_clean = df_clean.drop('Country', axis=1)
df_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 224 entries, 0 to 226 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Region 224 non-null object 1 Population 224 non-null int64 2 Area (sq. mi.) 224 non-null int64 3 Pop. Density (per sq. mi.) 224 non-null float64 4 Coastline (coast/area ratio) 224 non-null float64 5 Net migration 224 non-null float64 6 Infant mortality (per 1000 births) 224 non-null float64 7 GDP ($ per capita) 224 non-null float64 8 Literacy (%) 224 non-null float64 9 Phones (per 1000) 224 non-null float64 10 Arable (%) 224 non-null float64 11 Crops (%) 224 non-null float64 12 Other (%) 224 non-null float64 13 Climate 224 non-null float64 14 Birthrate 224 non-null float64 15 Deathrate 224 non-null float64 16 Agriculture 224 non-null float64 17 Industry 224 non-null float64 18 Service 224 non-null float64 dtypes: float64(16), int64(2), object(1) memory usage: 35.0+ KB
Now we can create the X array of features. Since the "Region" column still contains categorical strings, we will use Pandas to create dummy variables from this column to create a finalzed X array of continuous features along with dummy variables to replace the categorical regions.
len(df_clean['Region'].unique())
11
# this will transform the single region column in 11 new dummy variables (one for each unique region)
X = pd.get_dummies(df_clean)
X
| Population | Area (sq. mi.) | Pop. Density (per sq. mi.) | Coastline (coast/area ratio) | Net migration | Infant mortality (per 1000 births) | GDP ($ per capita) | Literacy (%) | Phones (per 1000) | Arable (%) | ... | Region_BALTICS | Region_C.W. OF IND. STATES | Region_EASTERN EUROPE | Region_LATIN AMER. & CARIB | Region_NEAR EAST | Region_NORTHERN AFRICA | Region_NORTHERN AMERICA | Region_OCEANIA | Region_SUB-SAHARAN AFRICA | Region_WESTERN EUROPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 31056997 | 647500 | 48.0 | 0.00 | 31056997.0 | 31056997.0 | 700.0 | 31056997.0 | 31056997.0 | 12.13 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 3581655 | 28748 | 124.6 | 1.26 | 3581655.0 | 3581655.0 | 4500.0 | 3581655.0 | 3581655.0 | 21.09 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 32930091 | 2381740 | 13.8 | 0.04 | 32930091.0 | 32930091.0 | 6000.0 | 32930091.0 | 32930091.0 | 3.22 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 57794 | 199 | 290.4 | 58.29 | 57794.0 | 57794.0 | 8000.0 | 57794.0 | 57794.0 | 10.00 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 71201 | 468 | 152.1 | 0.00 | 71201.0 | 71201.0 | 19000.0 | 71201.0 | 71201.0 | 2.22 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 221 | 16025 | 274 | 58.5 | 47.08 | 16025.0 | 16025.0 | 3700.0 | 16025.0 | 16025.0 | 5.00 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 222 | 2460492 | 5860 | 419.9 | 0.00 | 2460492.0 | 2460492.0 | 800.0 | 2460492.0 | 2460492.0 | 16.90 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 224 | 21456188 | 527970 | 40.6 | 0.36 | 21456188.0 | 21456188.0 | 800.0 | 21456188.0 | 21456188.0 | 2.78 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 225 | 11502010 | 752614 | 15.3 | 0.00 | 11502010.0 | 11502010.0 | 800.0 | 11502010.0 | 11502010.0 | 7.08 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 226 | 12236805 | 390580 | 31.3 | 0.00 | 12236805.0 | 12236805.0 | 1900.0 | 12236805.0 | 12236805.0 | 8.32 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
224 rows × 29 columns
X.columns
Index(['Population', 'Area (sq. mi.)', 'Pop. Density (per sq. mi.)',
'Coastline (coast/area ratio)', 'Net migration',
'Infant mortality (per 1000 births)', 'GDP ($ per capita)',
'Literacy (%)', 'Phones (per 1000)', 'Arable (%)', 'Crops (%)',
'Other (%)', 'Climate', 'Birthrate', 'Deathrate', 'Agriculture',
'Industry', 'Service', 'Region_ASIA (EX. NEAR EAST) ',
'Region_BALTICS ',
'Region_C.W. OF IND. STATES ',
'Region_EASTERN EUROPE ',
'Region_LATIN AMER. & CARIB ',
'Region_NEAR EAST ',
'Region_NORTHERN AFRICA ',
'Region_NORTHERN AMERICA ',
'Region_OCEANIA ',
'Region_SUB-SAHARAN AFRICA ',
'Region_WESTERN EUROPE '],
dtype='object')
Due to some measurements being in terms of percentages and other metrics being total counts (population), we should scale this data first.
from sklearn.preprocessing import StandardScaler
# define, fit, and transform using the scaler here:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
For our clustering analysis, we will use a for loop to create and fit multiple KMeans models by testing from K=2 to K=30 clusters. We store the "sum of squared distances" (SSD) for each K value and then plot this out to create an elbow plot of K versus SSD.
We could also create a bar plot showing the SSD difference from when using K+1 veruse K clusters.
from sklearn.cluster import KMeans
ssd = []
kmax = 31
# loop over a variety of K values
for K in range(2, kmax):
# build Kmeans model using current value of K
model = KMeans(n_clusters=K)
# fit the model and assign clusters using the scaled features
labels = model.fit_predict(X_scaled)
# store the labels with our feature DataFrame
kstring = "Cluster K" + str(K)
df_countries[kstring] = labels
# store the sum of squared distances
ssd.append(model.inertia_)
fig()
plt.plot(range(2, kmax), ssd, 'ko--');
plt.title('Elbow Plot for K-Means Clustering')
plt.xlabel('K Value')
plt.ylabel('Sum of Squared Distance')
plt.xticks(np.arange(2, kmax, 2))
plt.show()
# determine a good cut-off point for the clusters
pd.Series(ssd).diff()
0 NaN 1 -603.249291 2 -379.807863 3 -289.764405 4 -288.281574 5 -325.556874 6 -231.627464 7 -252.203731 8 -264.502233 9 -237.117823 10 -248.381109 11 -179.168694 12 -138.034346 13 -88.195927 14 -74.597776 15 -59.521360 16 -105.331566 17 -44.064877 18 -41.412093 19 -64.996553 20 -35.726171 21 -29.857266 22 -47.301780 23 -27.170701 24 -35.685507 25 -41.830524 26 -16.013829 27 -42.858342 28 -22.339089 dtype: float64
ssd_col = pd.Series(ssd)
diff_col = pd.Series(ssd).diff()
k_values = np.arange(2, kmax)
ssd_data = pd.concat([ssd_col, diff_col], axis=1)
# pd.DataFrame(ssd_data)
ssd_data.columns = ['SSD', 'DIFF']
ssd_data['K-Values'] = k_values
ssd_data = ssd_data[['K-Values', 'SSD', 'DIFF']]
ssd_data
| K-Values | SSD | DIFF | |
|---|---|---|---|
| 0 | 2 | 4856.751461 | NaN |
| 1 | 3 | 4253.502170 | -603.249291 |
| 2 | 4 | 3873.694307 | -379.807863 |
| 3 | 5 | 3583.929902 | -289.764405 |
| 4 | 6 | 3295.648329 | -288.281574 |
| 5 | 7 | 2970.091455 | -325.556874 |
| 6 | 8 | 2738.463991 | -231.627464 |
| 7 | 9 | 2486.260260 | -252.203731 |
| 8 | 10 | 2221.758027 | -264.502233 |
| 9 | 11 | 1984.640204 | -237.117823 |
| 10 | 12 | 1736.259096 | -248.381109 |
| 11 | 13 | 1557.090402 | -179.168694 |
| 12 | 14 | 1419.056056 | -138.034346 |
| 13 | 15 | 1330.860129 | -88.195927 |
| 14 | 16 | 1256.262353 | -74.597776 |
| 15 | 17 | 1196.740993 | -59.521360 |
| 16 | 18 | 1091.409427 | -105.331566 |
| 17 | 19 | 1047.344549 | -44.064877 |
| 18 | 20 | 1005.932456 | -41.412093 |
| 19 | 21 | 940.935903 | -64.996553 |
| 20 | 22 | 905.209732 | -35.726171 |
| 21 | 23 | 875.352466 | -29.857266 |
| 22 | 24 | 828.050686 | -47.301780 |
| 23 | 25 | 800.879985 | -27.170701 |
| 24 | 26 | 765.194478 | -35.685507 |
| 25 | 27 | 723.363954 | -41.830524 |
| 26 | 28 | 707.350124 | -16.013829 |
| 27 | 29 | 664.491783 | -42.858342 |
| 28 | 30 | 642.152694 | -22.339089 |
diff_vals = ssd_data['DIFF'][1:]
k_vals = ssd_data['K-Values'][1:]
fig()
sns.barplot(x=k_vals, y=diff_vals);
plt.xlabel('K Value')
plt.show()
# compute the percent difference for increasing from K to K+1 clusters
ssd_data['Percent_Diff'] = round(100.0 * ssd_data['DIFF'] / ssd_data['SSD'], 1)
ssd_data
| K-Values | SSD | DIFF | Percent_Diff | |
|---|---|---|---|---|
| 0 | 2 | 4856.751461 | NaN | NaN |
| 1 | 3 | 4253.502170 | -603.249291 | -14.2 |
| 2 | 4 | 3873.694307 | -379.807863 | -9.8 |
| 3 | 5 | 3583.929902 | -289.764405 | -8.1 |
| 4 | 6 | 3295.648329 | -288.281574 | -8.7 |
| 5 | 7 | 2970.091455 | -325.556874 | -11.0 |
| 6 | 8 | 2738.463991 | -231.627464 | -8.5 |
| 7 | 9 | 2486.260260 | -252.203731 | -10.1 |
| 8 | 10 | 2221.758027 | -264.502233 | -11.9 |
| 9 | 11 | 1984.640204 | -237.117823 | -11.9 |
| 10 | 12 | 1736.259096 | -248.381109 | -14.3 |
| 11 | 13 | 1557.090402 | -179.168694 | -11.5 |
| 12 | 14 | 1419.056056 | -138.034346 | -9.7 |
| 13 | 15 | 1330.860129 | -88.195927 | -6.6 |
| 14 | 16 | 1256.262353 | -74.597776 | -5.9 |
| 15 | 17 | 1196.740993 | -59.521360 | -5.0 |
| 16 | 18 | 1091.409427 | -105.331566 | -9.7 |
| 17 | 19 | 1047.344549 | -44.064877 | -4.2 |
| 18 | 20 | 1005.932456 | -41.412093 | -4.1 |
| 19 | 21 | 940.935903 | -64.996553 | -6.9 |
| 20 | 22 | 905.209732 | -35.726171 | -3.9 |
| 21 | 23 | 875.352466 | -29.857266 | -3.4 |
| 22 | 24 | 828.050686 | -47.301780 | -5.7 |
| 23 | 25 | 800.879985 | -27.170701 | -3.4 |
| 24 | 26 | 765.194478 | -35.685507 | -4.7 |
| 25 | 27 | 723.363954 | -41.830524 | -5.8 |
| 26 | 28 | 707.350124 | -16.013829 | -2.3 |
| 27 | 29 | 664.491783 | -42.858342 | -6.4 |
| 28 | 30 | 642.152694 | -22.339089 | -3.5 |
diff_perc = ssd_data['Percent_Diff'][1:]
temp = diff_perc.sort_values()
temp
10 -14.3 1 -14.2 8 -11.9 9 -11.9 11 -11.5 5 -11.0 7 -10.1 2 -9.8 12 -9.7 16 -9.7 4 -8.7 6 -8.5 3 -8.1 19 -6.9 13 -6.6 27 -6.4 14 -5.9 25 -5.8 22 -5.7 15 -5.0 24 -4.7 17 -4.2 18 -4.1 20 -3.9 28 -3.5 21 -3.4 23 -3.4 26 -2.3 Name: Percent_Diff, dtype: float64
fig()
sns.barplot(x=k_vals, y=diff_perc);
What are some of the possible cut-off values for K?
You could say that there is a significant drop off in SSD difference at K=3 (although we can see it continues to drop off past this). What would an analysis look like for K=3?
# KMeans model with 3 clusters
K = 3
model = KMeans(n_clusters=3)
# determine labels
labels = model.fit_predict(X_scaled)
labels
array([1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,
2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 0, 2,
1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1,
2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2,
2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1,
2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2,
1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 1, 1])
# check correlation with clusters
X_temp = X.copy()
X_temp['Cluster K3'] = df_countries['Cluster K3']
X_temp.corr()['Cluster K3'].iloc[:-1].sort_values()
Region_SUB-SAHARAN AFRICA -0.878875 Birthrate -0.741783 Deathrate -0.706757 Agriculture -0.616936 Other (%) -0.119293 Service -0.049691 Industry -0.049691 Climate -0.049691 Population -0.049691 Literacy (%) -0.049691 Infant mortality (per 1000 births) -0.049691 Net migration -0.049691 Phones (per 1000) -0.049691 Area (sq. mi.) -0.006349 Region_ASIA (EX. NEAR EAST) 0.016087 Region_BALTICS 0.064467 Crops (%) 0.078755 Region_NORTHERN AFRICA 0.083606 Region_NORTHERN AMERICA 0.083606 Arable (%) 0.096976 Pop. Density (per sq. mi.) 0.102032 Region_C.W. OF IND. STATES 0.131643 Region_EASTERN EUROPE 0.131643 Coastline (coast/area ratio) 0.149214 Region_NEAR EAST 0.153462 Region_OCEANIA 0.177965 Region_WESTERN EUROPE 0.204844 Region_LATIN AMER. & CARIB 0.277430 GDP ($ per capita) 0.428175 Name: Cluster K3, dtype: float64
fig()
sns.heatmap(X_temp.corr());
X_temp.corr()['Cluster K3'][:-3].sort_values().plot(kind='bar')
<AxesSubplot:>
# check correlation with 6 clusters
X_temp['Cluster K6'] = df_countries['Cluster K6']
X_temp.corr()['Cluster K6'].iloc[:-1].sort_values()
Region_SUB-SAHARAN AFRICA -0.579056 Deathrate -0.538082 Birthrate -0.220146 Region_WESTERN EUROPE -0.216382 Agriculture -0.140867 Region_EASTERN EUROPE -0.139057 Arable (%) -0.088771 Region_NORTHERN AMERICA -0.088315 Region_BALTICS -0.068098 Other (%) -0.064928 Region_C.W. OF IND. STATES -0.038252 Area (sq. mi.) -0.034666 GDP ($ per capita) -0.012365 Region_NORTHERN AFRICA 0.007717 Region_NEAR EAST 0.014165 Region_LATIN AMER. & CARIB 0.025607 Service 0.102630 Population 0.102630 Climate 0.102630 Phones (per 1000) 0.102630 Literacy (%) 0.102630 Infant mortality (per 1000 births) 0.102630 Net migration 0.102630 Industry 0.102630 Pop. Density (per sq. mi.) 0.141202 Coastline (coast/area ratio) 0.211729 Crops (%) 0.263228 Region_OCEANIA 0.425257 Cluster K3 0.458659 Region_ASIA (EX. NEAR EAST) 0.705637 Name: Cluster K6, dtype: float64
# check correlation with 14 clusters
X_temp['Cluster K14'] = df_countries['Cluster K14']
X_temp.corr()['Cluster K14'].iloc[:-1].sort_values()
Region_LATIN AMER. & CARIB -0.513716 Region_WESTERN EUROPE -0.440338 GDP ($ per capita) -0.291774 Deathrate -0.118892 Other (%) -0.101330 Region_SUB-SAHARAN AFRICA -0.075015 Region_NORTHERN AMERICA -0.070745 Arable (%) -0.065513 Service -0.065042 Industry -0.065042 Climate -0.065042 Population -0.065042 Phones (per 1000) -0.065042 Literacy (%) -0.065042 Infant mortality (per 1000 births) -0.065042 Net migration -0.065042 Region_EASTERN EUROPE 0.020978 Area (sq. mi.) 0.030562 Cluster K3 0.054815 Region_BALTICS 0.075097 Birthrate 0.127060 Agriculture 0.141333 Region_ASIA (EX. NEAR EAST) 0.164758 Pop. Density (per sq. mi.) 0.196948 Coastline (coast/area ratio) 0.210527 Region_NEAR EAST 0.255921 Crops (%) 0.297088 Region_NORTHERN AFRICA 0.307561 Region_C.W. OF IND. STATES 0.351904 Cluster K6 0.415404 Region_OCEANIA 0.454429 Name: Cluster K14, dtype: float64
The best way to interpret this model is through visualizing the clusters of countries on a real map of the world. Our goal is to create cluster labels for a chosen K value and display them on the map. Based on the solutions, we believe either K=3, K=6, or K=14 could be reasonable choices.
# we stored our results in the feature data with new columns for each set of labels
# produced by our different K-Means models: K={3, 6, 14}
df_countries[['Cluster K3', 'Cluster K6', 'Cluster K14']]
| Cluster K3 | Cluster K6 | Cluster K14 | |
|---|---|---|---|
| 0 | 0 | 5 | 6 |
| 1 | 2 | 1 | 5 |
| 2 | 2 | 2 | 12 |
| 3 | 2 | 4 | 9 |
| 4 | 2 | 1 | 0 |
| ... | ... | ... | ... |
| 221 | 2 | 4 | 9 |
| 222 | 2 | 2 | 8 |
| 224 | 2 | 2 | 8 |
| 225 | 0 | 0 | 4 |
| 226 | 0 | 0 | 4 |
224 rows × 3 columns
# install these modules if you do not have them already
# pip install "notebook>=5.3" "ipywidgets>=7.5"
# install plotly library
# pip install plotly==4.14.3
We will plot out these clusters on a country level choropleth map:
Make to have the plotly library installed: https://plotly.com/python/getting-started/
Example of how to create a geographical choropleth map using plotly: https://plotly.com/python/choropleth-maps/#using-builtin-country-and-state-geometries
We will need ISO country codes for this: either use the Wikipedia page, or use our provided file from the data folder in this repository this: "./data/country_iso_codes.csv"
Combine the cluster labels, ISO Codes, and Country Names to create a world map plot with the plotly library
# load the country ISO codes
df_iso = pd.read_csv('./data/country_iso_codes.csv')
df_iso
| Country | ISO Code | |
|---|---|---|
| 0 | Afghanistan | AFG |
| 1 | Akrotiri and Dhekelia – See United Kingdom, The | Akrotiri and Dhekelia – See United Kingdom, The |
| 2 | Åland Islands | ALA |
| 3 | Albania | ALB |
| 4 | Algeria | DZA |
| ... | ... | ... |
| 296 | Congo, Dem. Rep. | COD |
| 297 | Congo, Repub. of the | COG |
| 298 | Tanzania | TZA |
| 299 | Central African Rep. | CAF |
| 300 | Cote d'Ivoire | CIV |
301 rows × 2 columns
# convert the ISO codes to a dictionary
iso_map = df_iso.set_index('Country')['ISO Code'].to_dict()
# create a copy of our DataFrame that still contains the "Country" column
df_map = df_countries.copy()
# create a new column for the ISO codes
df_map['ISO CODE'] = df_countries['Country'].map(iso_map)
df_map
| Country | Region | Population | Area (sq. mi.) | Pop. Density (per sq. mi.) | Coastline (coast/area ratio) | Net migration | Infant mortality (per 1000 births) | GDP ($ per capita) | Literacy (%) | ... | Cluster K22 | Cluster K23 | Cluster K24 | Cluster K25 | Cluster K26 | Cluster K27 | Cluster K28 | Cluster K29 | Cluster K30 | ISO CODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | ASIA (EX. NEAR EAST) | 31056997 | 647500 | 48.0 | 0.00 | 31056997.0 | 31056997.0 | 700.0 | 31056997.0 | ... | 6 | 9 | 5 | 19 | 7 | 1 | 20 | 26 | 27 | AFG |
| 1 | Albania | EASTERN EUROPE | 3581655 | 28748 | 124.6 | 1.26 | 3581655.0 | 3581655.0 | 4500.0 | 3581655.0 | ... | 9 | 0 | 6 | 7 | 5 | 11 | 7 | 9 | 6 | ALB |
| 2 | Algeria | NORTHERN AFRICA | 32930091 | 2381740 | 13.8 | 0.04 | 32930091.0 | 32930091.0 | 6000.0 | 32930091.0 | ... | 14 | 14 | 13 | 11 | 13 | 8 | 8 | 10 | 7 | DZA |
| 3 | American Samoa | OCEANIA | 57794 | 199 | 290.4 | 58.29 | 57794.0 | 57794.0 | 8000.0 | 57794.0 | ... | 10 | 19 | 21 | 8 | 9 | 12 | 12 | 8 | 10 | ASM |
| 4 | Andorra | WESTERN EUROPE | 71201 | 468 | 152.1 | 0.00 | 71201.0 | 71201.0 | 19000.0 | 71201.0 | ... | 0 | 6 | 4 | 4 | 4 | 15 | 5 | 4 | 19 | AND |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 221 | Wallis and Futuna | OCEANIA | 16025 | 274 | 58.5 | 47.08 | 16025.0 | 16025.0 | 3700.0 | 16025.0 | ... | 10 | 7 | 21 | 8 | 9 | 12 | 12 | 8 | 9 | WLF |
| 222 | West Bank | NEAR EAST | 2460492 | 5860 | 419.9 | 0.00 | 2460492.0 | 2460492.0 | 800.0 | 2460492.0 | ... | 5 | 4 | 10 | 10 | 25 | 23 | 0 | 19 | 11 | NaN |
| 224 | Yemen | NEAR EAST | 21456188 | 527970 | 40.6 | 0.36 | 21456188.0 | 21456188.0 | 800.0 | 21456188.0 | ... | 16 | 4 | 10 | 10 | 17 | 0 | 14 | 7 | 25 | YEM |
| 225 | Zambia | SUB-SAHARAN AFRICA | 11502010 | 752614 | 15.3 | 0.00 | 11502010.0 | 11502010.0 | 800.0 | 11502010.0 | ... | 20 | 20 | 18 | 24 | 24 | 22 | 25 | 6 | 2 | ZMB |
| 226 | Zimbabwe | SUB-SAHARAN AFRICA | 12236805 | 390580 | 31.3 | 0.00 | 12236805.0 | 12236805.0 | 1900.0 | 12236805.0 | ... | 20 | 20 | 18 | 24 | 24 | 22 | 25 | 6 | 2 | ZWE |
224 rows × 50 columns
# import the choropleth plot from plotly
import plotly.express as px
# create map with our 3 clusters
f = px.choropleth(df_map,
locations="ISO CODE",
color="Cluster K3",
hover_name="Country", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
f.show()
# create map with our 6 clusters
f = px.choropleth(df_map,
locations="ISO CODE",
color="Cluster K6",
hover_name="Country", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
f.show()
# create map with our 14 clusters
f = px.choropleth(df_map,
locations="ISO CODE",
color="Cluster K14",
hover_name="Country", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
f.show()